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ABSTRACT 

In  most  low-power  VLSI  designs,  the  supply  voltage  is  usually  reduced  to  lower  the  total  power 
consumption.  However,  the  device  speed  will  be  degraded  as  the  supply  voltage  goes  down.  In 
this  paper,  we  propose  new  algorithmic-level  techniques  for  compensating  the  increased  delays  based 
on  the  multirate  approach.  We  present  two  methods,  the  Chebyshev  polynomial  approach  and  the 
polyphase  decomposition  approach,  to  design  low-power  but  high-speed  transform  coding  architec¬ 
tures.  We  will  show  how  to  compute  the  discrete  cosine  transform  (DCT)  and  its  inverse  (IDCT) 
through  the  decimated  low-speed  sequences  with  reasonable  linear  hardware  overhead.  For  the  case 
the  decimation  factor  equal  to  two,  the  overall  power  consumption  can  be  reduced  to  about  one-third 
of  the  original  design  at  the  architectural  level.  Extension  of  our  design  to  higher  decimation  rate  is 
also  achievable  and  can  result  in  even  lower  power  consumption.  The  resulting  multirate  low-power 
architectures  are  regular,  modular,  and  free  of  global  communications.  Also,  the  compensation  ca¬ 
pability  is  achieved  at  the  expense  of  locally  increased  hardware  and  data  paths.  As  a  consequence, 
they  are  very  suitable  for  VLSI  implementation.  The  proposed  architectures  can  also  be  applied  to 
very  high-speed  block  transforms  where  only  low-speed  operators  are  required.  The  extensions  of  the 
algorithm-based  low-power  design,  such  as  the  unified  transform  architecture  and  finite-wordlength 
effect  of  the  design,  will  be  discussed  in  the  companion  paper. 


This  work  was  supported  in  part  by  the  ONR  grant  N00014-93-10566  and  the  NSF  NYI  Award 
MIP9457397. 
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1  Introduction 

Recent  developments  in  personal  communications  services  (PCS)  have  now  made  it  possible  to  in¬ 
tegrate  voice,  image,  and  cellular  phone  networks  in  a  personal  communicator.  Due  to  the  limited 
power-supply  capability  of  current  battery  technology,  the  power  constraint  becomes  an  important 
consideration  in  the  design  of  PCS  devices.  It  has  been  shown  that  a  reduction  of  the  supply  voltage 
is  the  leveraged  decision  to  lower  the  power  consumption.  However,  a  speed  penalty  is  suffered  for 
the  devices  (operators)  as  the  supply  voltage  goes  down  [1],  In  order  to  meet  the  low-power/high- 
throughput  constraint,  the  key  issue  is  to  “compensate”  the  increased  delay  so  that  the  device  can 
be  operated  at  the  slowest  possible  speed  while  maintaining  the  same  data  sample  rate.  In  [1],  the 
techniques  of  “parallel  processing”  and  “pipelining”  were  suggested  to  compensate  the  speed  penalty, 
in  which  a  simple  comparator  circuit  was  used  to  demonstrate  how  parallel  independent  processing 
of  the  data  can  achieve  good  compensation  at  the  architectural  level.  In  most  digital  signal  process¬ 
ing  (DSP)  applications,  however,  the  problems  encountered  are  much  more  complex.  It  is  almost 
impossible  to  directly  decompose  the  problems  into  independent  but  parallel  tasks.  Therefore,  the 
properties  of  the  DSP  algorithms  should  be  fully  exploited  in  order  to  develop  efficient  compensation 
techniques  to  compensate  the  loss  of  performance  under  low-power  operations.  The  main  issue  here 
is  to  reformulate  the  algorithms  so  that  the  desired  output  can  be  obtained  without  hindering  the 
system  performance  such  as  data  throughput  rate.  We  call  such  an  approach  the  algorithm-based 
low-power  design.  The  ultimate  goal  is  to  achieve  the  low-power  design  requirement  only  at  the 
expense  of  larger  chip  area  under  current  technology,  without  invoking  dedicated  arithmetic  circuit 
design,  new  expensive  device  material,  and  advanced  VLSI  fabrication  technology. 

In  this  paper,  we  will  show  how  to  design  algorithm-based  low-power  architectures  for  transform 
coding.  An  algorithm-based  compensation  technique  using  the  multirate  approach  is  proposed  to 
reduce  the  power  consumption.  To  motivate  the  idea,  let  us  consider  the  discrete  cosine  transform 
(DCT)  architecture  in  Fig.l.  For  most  of  the  existing  serial-input-parallel-output  (SIPO)  DCT  algo¬ 
rithms  and  architectures  [2]  [3],  the  processing  rate  must  be  as  fast  as  the  input  data  rate  (Fig.  1(a)). 
In  our  low-power  design,  the  DCT  is  computed  from  the  reformulated  circuit  using  the  decimated 
sequences  (Fig.  1(b)).  It  is  now  a  multirate  system  that  operates  at  two  different  sample  rates.  Since 
the  operating  speed  of  the  processing  elements  is  reduced  to  half  of  the  original  data  rate  while  the 
data  throughput  rate  is  still  maintained,  the  speed  penalty  is  compensated  at  the  architectural  level. 
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As  to  the  power  consumption,  using  the  CMOS  power  dissipation  model  [1],  we  can  predict  that  the 
overall  power  consumption  of  the  multirate  design  can  be  reduced  to  about  one-third  of  the  original 
system.  Therefore,  the  downsampling  scheme  provides  a  direct  and  efficient  way  for  the  low-power 
design  at  the  algorithmic/architectural  level. 

Two  different  approaches  to  achieve  multirate  low-power  transform  kernel  design  are  presented  in 
this  paper.  One  is  based  on  the  properties  of  the  Chebyshev  polynomial.  The  Chebyshev  polynomial 
derivation  of  the  DCT/IDCT  algorithm  was  first  proposed  in  [4].  However,  the  architecture  in 

[4]  needs  global  communication  and  requires  0(N  log  N)  multipliers.  In  our  work,  we  treat  the 
transforms  as  the  evaluation  of  a  Chebyshev  series.  By  exploiting  the  recurrence  property  of  the 
Chebyshev  polynomial,  we  can  compute  the  DCT/IDCT  through  the  decimated  sequences  with  linear 
increase  of  hardware  complexity;  hence  the  speed  penalty  can  be  compensated.  The  other  is  based 
on  the  polyphase  decomposition  approach  which  is  an  effective  tool  in  multirate  signal  processing 

[5] .  By  applying  the  polyphase  decomposition  to  the  HR  transfer  functions  of  the  DCT/IDCT  [3], 
we  can  also  perform  the  DCT/IDCT  using  the  multirate  approach. 

A  major  advantage  of  our  multirate  low-power  architectures  is  that  they  inherit  all  advantages 
of  the  SIPO  transform  architectures  discussed  in  [2]  and  [3]  such  as  local  communication,  regularity, 
modularity,  and  linear  hardware  complexity.  This  makes  the  proposed  architectures  very  suitable 
for  VLSI  implementation.  Also,  the  power  consumption  in  the  routing  and  layout  of  the  chip  can  be 
minimized.  Moreover,  unlike  most  parallel-input-parallel-output  (PIPO)  transform  architectures,  the 
speed-compensation  capability  of  our  architectures  is  achieved  at  the  expense  of  “locally”  increased 
hardware  complexity  and  routing  paths.  Since  the  topological  property  (communications  between 
chip  modules)  is  usually  given  the  highest  priority  from  the  aspects  of  VLSI  system  design  [6,  chap.8], 
this  feature  of  local  interconnection  and  local  hardware  overhead  is  especially  preferable  when  the 
transformation  size  is  large  ( e.g .,  the  MPEG  audio  codec  in  which  a  32-point  DCT/IDCT  is  used 

[710- 

In  the  companion  paper  [8],  we  will  extend  the  low-power  design  to  a  larger  class  of  transforms, 
which  leads  to  a  unified  transformation  architecture.  Some  design  issues,  such  as  reduction  of  the 
complexity  for  the  compensation  technique  and  the  finite-wordlength  behavior  of  the  low-power 
design,  will  also  be  considered.  Those  analyses  provide  us  more  insights  to  the  algorithm-based 
low-power  design. 

The  organization  of  this  paper  is  as  follows.  The  derivation  of  the  low-power  IDCT/DCT  algo- 
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rithms  and  architectures  based  on  the  Chebyshev  polynomial  is  described  in  Section  2.  The  multirate 
HR  DCT/IDCT  structures  using  the  polyphase  decomposition  are  presented  in  Section  3.  The  com¬ 
parison  of  both  low-power  designs  with  other  approaches  is  discussed  in  Section  4  followed  by  a 
conclusion. 

2  The  Chebyshev  Polynomial  Approach 

The  nth  order  Chebyshev  polynomial  is  defined  as  [9,  chap  1] 

Tn(ri)  —  cos(nu>),  cosu  =  77,  ,  —  1  <  rj  <  1,  (1) 

which  can  be  generated  from  the  “three-term  recurrence”  formula 

Tn+ 1(77)  =  277^(77)  —  Tn— 1(7?)  (2) 

with  the  initial  condition  To  (77)  =  1 ,  Ti  (77)  =  77.  Now  consider  the  following  Chebyshev  series 

^  JV-l  j  N-l 

Yc(v)  =  7^0  +  ak  cos{ku)  =  -a0  +  ^  akTk{r)),  (3) 

Z  k= 1  A  k= 1 

where  ak,  k  —  0, 1, . . . ,  N  —  1,  are  constant  coefficients.  One  efficient  way  to  evaluate  Yc(rj)  for  a 
given  value  77  is  the  Clenshaw’s  algorithm  [9,  chap  3]  [10,  chap  4],  in  which  a  “backward  recurrence 
sequence”  is  defined  as 

h(v)  =  2776^+1(77)  -  bk+2{v)  for  fc  =  iV”  —  1, . . . ,  1,0  (4) 

with  the  initial  conditions  6^(77)  =  5^+1  (r/)  —  0.  After  substituting  (4)  into  (3),  and  applying  the 
recurrence  formula  in  (2),  we  can  simplify  the  evaluation  of  Yc(r])  as 

Yc(v)  =  ihiv)  ~  2?7 bk+i(rj)  +  bk+2{r])}Tk{ v)  =  9  bi^  . 

k= 0  Z 


(5) 
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Later  in  the  DCT/IDCT,  we  will  need  the  evaluation  of 

JV-l  N—l 

Yc(v)  =  ^2  «fccos(M  =  ]T  akTk(f})-  (6) 

fc= 0  fc=0 

It  can  be  seen  that  by  scaling  do  by  2  (a  left  shift)  beforehand,  we  can  evaluate  Y^(rj)  through  the 
same  steps  in  (4)-(5).  The  corresponding  architecture  to  evaluate  Yc'(r/)  is  shown  in  Fig.2,  where  do 
has  been  pre-scaled  by  two.  Since  b{  s  are  generated  in  a  “backward”  manner,  the  input  sequence 
is  in  reverse  order.  The  second-order  recurrence  structure  in  the  middle  computes  bi  s  according  to 
(4).  After  the  last  input  is  fed  into  the  system,  bo(r])  and  62  (^?)  will  be  available  and  Y'c  (rj)  can  be 
evaluated  from  (5)  with  one  addition  and  one  right-shift  operation. 

Another  two  Chebyshev  polynomial  properties  that  will  be  useful  in  later  derivations  are  [9,  chap 
3]: 

1.  Composition  property : 

Ts(Tr(V))=Tr(Ts(r]))=Trs(v),  (7) 

which  allows  us  to  represent  a  higher-order  Chebyshev  polynomial  using  lower-order  ones,  and 
vice  versa. 

2.  Product-sum  relationship: 


Ts(ri)Tr(V)  =  \(Ts+r(r,)  +  Ta_rfa)),  (8) 

which  shows  that  the  product  of  two  Chebyshev  polynomials  can  be  decomposed  into  the  sum 
of  two  Chebyshev  polynomials,  and  vice  versa. 

2.1  Chebyshev  IDCT  Architecture 

In  order  to  illustrate  the  relationship  between  the  Chebyshev  polynomial  and  the  transforms,  we  will 
begin  with  the  derivation  of  the  IDCT  algorithm.  Let  X(k),  k  =  0, 1,  •  •  • ,  N  —  1,  be  a  DCT-domain 
sequence.  The  block  IDCT  to  compute  the  time-domain  sequence  x(n),  n  =  0, 1,  •  •  • ,  N  —  1,  is  defined 
as 

s(»)  =  J2  g(fc)X(fc)c°s[(2n9+1)7rfc],  (9) 

k- 0  Ziv 
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where 

C(k)  =  < 

is  the  scaling  factor  used  in  the  DCT/IDCT. 

A  (2 n  +  l)7r 

=  2N 


yj,  if  fc  =  0 

otherwise 


If  we  define 


(10) 


(11) 


and  use  the  definition  of  the  Chebyshev  polynomial  in  (1),  (9)  can  be  written  as 

N- 1  N- 1 

x(n)  =  ^  C(k)X(k)  cos (kun)  =  5]  X'(k)Tk(Vn)  (12) 

k= 0  fc=0 

where 

%  =  coso>n  (13) 

and  X'(k)  —  C(k)X(k)  is  the  scaled  input  data.  Comparing  (12)  with  (6),  we  see  that  the  IDCT 
with  index  n  can  be  treated  as  the  evaluation  of  Chebyshev  series  at  %  with  coefficients  X'(k), 
fc  =  0, 1,...,1V  —  1.  As  a  consequence,  the  recursive  architecture  in  Fig.2  can  perform  the  IDCT  at 
center  frequency  un  if  we  replace  the  multiplier  coefficient  7/  with  rjn. 

Fig.3  shows  the  IDCT  structure  based  on  the  Chebyshev  evaluation.  It  has  two  parts:  the  Reverse 
Array  (RA)  and  the  IDCT  module  array.  The  RA  consists  of  one  serial-input-parallel-output  (SIPO) 
register  array  and  one  parallel-input-serial-output  (PISO)  register  array.  It  provides  the  capability  of 
reversing  the  input  sequence  and  scaling  X(0)  in  a  fully  pipelined  way.  The  IDCT  module  performs 
(12)  at  different  index  n.  Since  n  varies  from  0  to  N  —  1,  we  need  N  IDCT  modules  to  compute  the 
IDCT  in  parallel.  The  whole  system  works  in  a  SIPO  way  and  requires  only  N  +  1  multipliers  and 
3N  adders  including  the  scaling  multiplier  in  RA.  The  number  of  multipliers  is  almost  as  low  as  that 
in  Hou’s  algorithm  [11].  Besides,  there  is  no  restriction  on  the  block  size  N  and  the  regularity  of  our 
IDCT  architecture  is  more  suitable  for  VLSI  implementation. 
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2.2  Chebyshev  DCT  Architecture 

The  DCT  of  the  time-domain  block  data  x(n),  n  —  0, 1, . . . ,  N  —  1,  is  defined  as 


N~l  kir 

X(k)  =  C{k)  x(n)cos[(2n  +  1)—  ],  k  =  0, 1, . . .  ,N  -  1.  (14) 

n= 0 

As  with  the  derivation  of  the  IDCT  algorithm,  the  DCT  can  be  represented  as 

N-l  JV-1 

X(k)  =  C(k)  ^  x(n)  cos[(2n  +  l)wfc]  =  C(k)  ^  x(n)T2n+i(Vk )  (15) 

71=0  71=0 

where  Uk  =  and  %  =  cos  u>k-  Multiplying  Ti(%)  on  both  sides  of  (15)  and  using  the  Chebyshev 
property  in  (8),  we  obtain 


Ti{r)k)X(k) 


cm  £  ^  [(T2„(»i)  +  W0*)J 

n=0  Z 

^p-^x'{n)T2n{rik) 

Z  n=0 


(16) 


where 

x'(n)  =  a;(n  —  1)  +  x(n),  n  —  0,l,...,N  (17) 

with  the  assumption  of  x(— 1)  =  x(N)  =  0.  Recall  that  T\{r]k)  —  Vk  and  T2n(r]k)  =  Tn(T2{r}k))  (from 
(7)).  If  we  define 

•n'k  =  T2{r)k)  =  cos(2u>a;)  =  2??fc  -  1,  (18) 

X(k)  in  (16)  can  be  computed  as 


X(k )  =  ®  £  x'(n)Tn(v'k ),  *  =  0, 1, . . . ,  IV  -  1. 


71=0 


(19) 


Therefore,  the  DCT  at  center  frequency  can  be  obtained  by  evaluating  the  Chebyshev  series  at 
the  value  r/'k  with  coefficients  x'(n),  n  =  0, 1, . . . ,  N,  followed  by  the  scaling  of 

Note  that  the  DCT  of  the  reversed  sequence  x(n)  —  x(N  —  1  —  n),  n  =  0, 1, . . . ,  N  —  1,  is 

N~l  N~ 1  (2n+l)kir 

X(k)  =  C(k)  £  x(n)  cos[(2n  +  lV*,]  —  C(k)  £  x(n)  cos[/c7r - — — - ],  (20) 

n= 0  n= 0  ^ 
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for  k  =  0, 1,. ..  ,N  —  1.  We  can  relate  X(k)  to  X(k)  by  X(k)  =  (-1  )kX(k).  As  a  result,  the  RA, 
which  is  used  to  reverse  the  input  sequence,  can  be  eliminated  by  complementing  the  odd-indexed 
X(kys  while  keeping  the  even- indexed  X(k)'s  unchanged.  Fig.4  shows  the  architecture  to  implement 
the  Chebyshev  DCT  algorithm.  The  overall  Chebyshev  DCT  architecture  needs  a  total  of  2N  -  2 
multipliers  and  3 N  - 1  adders.  It  should  be  noted  that  the  total  number  of  x'(n)  is  N+ 1.  Therefore, 
an  extra  zero  is  appended  after  x(N  -  1)  for  the  generation  of  x'(n),  n  =  0, 1, . . . ,  N.  After  x'(N)  is 
sent  to  the  DCT  array,  we  can  obtain  the  DCT  coefficients  in  parallel  at  the  array  outputs. 

2.3  Low-Power  Design  for  the  DCT/IDCT 

Consider  the  Chebyshev  series  in  (6)  and  split  it  into  the  even  and  odd  series: 

N/  2-1  N/  2-1 

Yc(v)  =  E  0.2iT2i(v)  +  E  a2i+\T2i+l{ri) 

i=0  i=0 

=  ym  +  ym  (21) 


where  Ye(rj)  and  Y0(r])  denote  the  even  and  odd  series,  respectively.  By  the  use  of  (2)  and  (7),  Ye(rj) 
can  be  written  as 

AT/2— 1  N/  2-1 

Ye(v)=  E  «2 iTi(T2(r,))=  E  oxTitf)  (22) 

1=0  i=0 

with  rf  =  2 rj2  —  1.  On  the  other  hand,  Y0(r])  can  be  converted  into  an  even  series  by  following  the 
derivations  in  (15)- (19): 


N/2 


Yo(v)  =  E  K(Q2i-i  +  a2i+i)Ti{r]'). 


(23) 


i=0 


where  k  =  ^  is  a  pre-calculated  constant  coefficient.  Now  combining  (22)  and  (23)  together,  we 
have 

N/2  N/2 

Yc(v)  =  Efa2i  +  K(a2i-i  +  a2i+i)]Ti(r]')  =  E  diTiiv')  (24) 


i= 0 


i= 0 


with 


d%  —  ^2^  +  k(  a<2i-i  +  a,2i+i),  i 
even  odd 


(25) 


From  (24)  we  can  see  that  the  evaluation  of  a  TV-point  Chebyshev  series  can  be  reduced  to  a  (A/2  + 1)- 
point  evaluation  using  the  new  sequence  di  s  which  are  composed  of  decimated  sequences.  This  new 
evaluation  method  can  be  easily  applied  to  the  computation  of  the  IDCT/DCT  as  described  in  Section 
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2.1  and  Section  2.2.  The  resulting  IDCT  architecture  is  depicted  in  Fig.5,  where  Kn  =  1/(2%)  and 
Tin  =  2 rfo  -  1  with  r]n  defined  in  (13).  Firstly,  a/s  in  (24)  are  replaced  with  X(i)’s  in  (9),  for 
i  =  0, 1, . . .  N  —  1,  then  we  use  one  decimation  circuit  and  one  adder  to  compute  the  even  and  odd 
sequences  in  (25)  from  the  X(i)  sequence  in  a  fully- pipelined  way  (see  the  left-hand  side  of  Fig.5). 
After  these  two  decimated  sequences  are  reversed  by  the  RA,  they  are  combined  together  to  generate 
di  s  in  (25),  and  d/s  are  sent  to  the  IDCT  module  to  perform  the  Chebyshev  evaluation  in  (24).  Once 
the  evaluation  is  completed,  we  have  the  IDCT  coefficient  with  index  n  at  the  module  output.  Since 
the  operating  frequency  halves  after  the  decimator,  now  we  can  use  two  times  slower  multipliers  and 
adders  in  this  IDCT  module  with  some  hardware  overhead.  Meanwhile,  the  throughput  rate  is  still 
retained.  Similarly,  the  multirate  Chebyshev  DCT  architecture  can  be  derived  as  shown  in  Fig.6, 
where  Kk  =  1/(2 r]'k),  rfk  =  2 r(k  -  1,  and  rfk  is  defined  in  (18). 

To  achieve  downsampling  by  four,  we  can  recursively  compute  another  new  (1V/4  +  l)-sequence 
e{  from  di ,  which  results  in 

ej  =  k'k  [(G4j_3  +  a>4i+i)  +  (a4j_i  +  a4j+3)]  +  K1 (04^2  +  042+2)  +  k(«4*-1  +  «4*+l)  +  04*,  .  . 

(26) 

for  i  =  0, 1, . . . ,  f , 

where  «'  =  ^  is  also  a  pre-computed  constant.  One  possible  realization  of  (26)  is  depicted  in  Fig.7. 
Once  the  e/s  are  computed  from  the  decimated  sequences  04*+*,  k  =  0,1,2, 3,  the  evaluation  of 
Vg'  (77)  can  be  computed  as 

AT/4 

Y’(r,)  =  '£eiTi(  V")  (27) 

*=o 

with  77"  =  2 T]'2  —  1.  Likewise,  based  on  (26)  and  (27),  we  can  also  construct  the  multirate  IDCT 
and  DCT  architectures  as  shown  in  Fig.8  and  Fig.9,  in  which  only  four  times  slower  operators  are 
required  to  compute  the  transform  coefficients. 

2.3.1  Power  Estimation  for  the  Low-Power  Design 

Now  let  us  consider  the  power  dissipation  of  the  low-power  architectures.  The  power  dissipation  in 
a  well-designed  digital  CMOS  circuit  can  be  modeled  as  [12] 


P^CejJ-  V}d  •  fclk, 


(28) 
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where  Ceg  is  the  effective  loading  capacity,  Vdd  is  the  supply  voltage,  and  /cifc  is  the  operating 
frequency.  Also,  the  lowest  possible  supply  voltage  Vdd  can  be  approximated  by  [1][13] 


Yk  =  M  Ym 
Kd-vtY  (i vdd-vtr 


(29) 


where  M  is  the  decimation  factor  and  V)  is  the  threshold  voltage  of  the  device. 

Assume  that  Vdd  =  5F,  Vt  =  0.7V  in  the  original  system.  For  the  16-point  Chebyshev  IDCT 
under  normal  operation,  it  requires  18  multipliers  and  48  adders.  For  the  low-power  16-point  IDCT 
with  M  —  2,  34  multipliers  and  65  adders  are  required.  From  (29),  it  can  be  shown  that  Vdd  can  be 
as  low  as  3. IF  for  the  case  M  =  2.  Provided  that  the  capacitance  due  to  the  multipliers  is  dominant 
in  the  circuit  and  is  roughly  proportional  to  the  number  of  multipliers,  the  power  consumption  of 
the  low-power  design  can  be  estimated  as 


,34 


<I8C‘*X  5F 


3. IF, o.l 


r(n/)  ~  0.36Poi 


(30) 


where  Pq  denotes  the  power  consumption  of  the  original  system.  Similarly,  for  the  case  M  —  4, 
the  16-point  IDCT  needs  a  total  of  66  multipliers  and  100  adders.  Since  the  lowest  possible  supply 
voltage  can  be  2. IF  (from  (29)),  the  total  power  can  be  reduced  to 

(i  c«)(w",2(i/)  “  01WV  (31) 


Therefore,  we  can  achieve  low-power  consumption  at  the  expense  of  reasonable  complexity  overhead. 
Such  a  tradeoff  will  be  considered  in  Section  4. 


3  The  Polyphase  Decomposition  Approach 

Performing  orthogonal  transforms  based  on  the  HR  transfer  function  approach  was  studied  in  [3].  By 
considering  the  transform  operator  as  a  linear  shift  invariant  (LSI)  system  that  maps  the  serial  input 
data  into  their  transform  coefficients,  the  authors  in  [3]  have  shown  that  most  discrete  sinusoidal 
transforms  can  be  realized  by  using  a  unified  HR  structure.  In  this  section,  we  will  show  that, 
in  addition  to  the  Chebyshev  approach,  we  can  also  derive  the  multirate  low-power  DCT/IDCT 
algorithms/architectures  by  applying  the  polyphase  decomposition  [5]  to  the  HR  transfer  functions 
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in  [3].  We  will  see  later  that  the  polyphase  decomposition  approach  provides  a  systematic  way  for 
architectural  low-power  design. 


3.1  The  HR  DCT  Algorithm 

The  one-dimensional  (1-D)  DCT  of  a  series  of  input  data  starting  from  x(t  —  N  +  1)  and  ending  at 
x(t)  is  defined  as 


N- 1 


kir . 


XDCT,k(t )  =  c{k)  Y  cos[(2n  +  1)— ]  x{t  +  n  -  N  +  1), 

n= 0 

for  k  =  0, 1, 2, . . . ,  N  —  1.  A  second-order  IIR  transfer  function  can  be  derived  from  (32)  as  [3] 

Nfc  —N\  C(k)  cos  wk(l  -  z-1) 


(32) 


_  XpcT,k(z) 

X(z) 


HdctA* )  =  — =  (Mr  -  *  )fr 


2  cos  2  u)kz~l  +  z~2 


(33) 


where  uk  =  Xpcr,k{ z)  and  X(z)  denote  the  2-transforms  of  XpcT,k(t )  and  x(t),  respectively. 

For  block  processing,  the  z~N  in  (33)  can  be  eliminated  because  of  the  reset  operation  for  every  N 
cycles.  The  corresponding  IIR  structure  to  compute  the  kth  frequency  component  of  the  DCT  is 
shown  in  Fig.  10,  in  which 

Fc(m)  =  (— l)kC{k)  cosmuh-  (34) 


Once  the  last  serial  input  x(t)  is  fed  into  the  module,  the  DCT  coefficients  can  be  obtained  at 
the  module  outputs  in  parallel.  The  resulting  parallel  architecture  is  regular,  modular,  and  fully- 
pipelined.  Also,  the  SIPO  feature  can  avoid  the  input  buffers  as  well  as  the  index  mapping  operation 
that  are  required  in  most  PIPO  DCT  architectures  [11]  [14].  One  disadvantage  of  the  IIR  structure 
in  Fig.10  is  that  the  operation  speed  is  constrained  by  the  recursive  loops.  In  what  follows,  we 
will  reformulate  the  transfer  function  using  the  multirate  approach  so  that  speed  constraint  can  be 
alleviated. 


3.2  Low-Power  Design  of  the  IIR  DCT 

Splitting  the  input  data  sequence  into  the  even  sequence 


xe(t,  n)  —  x(t  +  2n  —  N  +  1),  n  =  0, 1, ... ,  N/2  —  1 


(35) 
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and  the  odd  sequence 


x0(t,  n)  =  x(t  +  2n  —  N  +  2),  n  =  0, 1, . . . ,  N/2  —  1, 


(36) 


(32)  becomes 

N/2~ 1  kn  N/2~l  k7r 
XdctA1)  =  C(k)  Y  cos[(4n+  1)  —  ]xe(t,n)  +  C(k)  Y  cos[(4n  +  3)—]  x0(t,n).  (37) 

n= 0  n= 0 

Taking  the  ^-transform  on  both  sides  of  (37)  and  rearranging,  we  have 

XDCT,k{z)  =  p — zt T ~=2  (iXe{z)  -X0(2)2r_1]cos3wfc  +  [X0(z)  -  Xe(z)z~1]  cos uk) 

L  —  L  cos  ‘itn^z  1  +  z  *  v  / 

(38) 

where  Xe(z)  and  X0(z)  are  the  ^-transforms  of  xe(t,n)  and  x0(t,ri),  respectively.  The  parallel 
architecture  to  realize  (38)  is  depicted  in  Fig.ll.  The  common  circuit  at  the  left-hand  side  decimates 
the  input  serial  data  into  the  even  and  odd  sequences  and  generates  the  common  inputs  for  the  module 
array.  The  numerator  part  and  the  denominator  part  of  (38)  are  realized  by  the  FIR  structure  and 
the  HR  structure  inside  each  DCT  module  at  different  index  k.  The  overall  architecture  requires 
(3 N  —  3)  multipliers  and  (3 N  +  1)  adders  plus  a  decimation  circuit.  Compared  with  the  HR  DCT 
structure  in  Fig.  10,  this  multirate  DCT  structure  needs  only  ( N  —  1)  extra  multipliers  and  ( N  +  1) 
extra  adders. 

To  achieve  downsampling  by  the  factor  of  four,  we  can  split  the  input  data  sequence  into  four 
decimated  sequences 

gi(t,n)  =  x(t  +  (4n  +  i)  —  N  +  1),  *  =  0,1,2, 3,  (39) 

for  n  =  0, 1, . . . , N/ 4  —  1.  Following  the  derivations  in  (37)-(38),  we  can  write  XncT,k{z)  as 

XDCT,k(z )  =  — zfT^([Go(^)-G!3(^_1]cos7wfe  +  [Gi(2:)-G2(^-1]cos5a;A: 

’  1  —  2  cos  8  WkZ  +  z  1  v 

+  [G2(-z)  -  Gi(z),z-1]cos3cjfc  +  [G3(z)  -  Go(z)z_1]coswfe)  .  (40) 

where  Gi(z)  is  the  ^-transform  of  gi(t,n),  i  =  0,1,2, 3.  The  corresponding  multirate  architecture  is 
shown  in  Fig.  12. 

From  Fig.ll  and  Fig.12,  we  can  see  that  basically  the  multirate  DCT  architectures  retain  all  ad- 
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vantages  of  the  original  HR  structure  in  [3]  such  as  modularity,  regularity,  and  local  interconnections. 
It  is  also  interesting  to  note  that  the  increase  in  hardware  overhead  grows  only  locally  rather  than 
globally,  and  the  DCT  architecture  with  M  =  4  can  be  generated  by  reusing  the  modules  in  the 
M  =  2  design  ( e.g .,  the  FIR/IIR  structures  and  the  lattice  structure  in  the  common  circuit).  There¬ 
fore,  neither  global  routing  nor  new  module  design  is  required  in  the  M  =  4  case.  The  characteristics 
of  scalability,  modularity,  and  local  interconnections  make  the  multirate  structures  very  suitable  for 
VLSI  implementation.  Unlike  most  PIPO  DCT  algorithms  in  which  the  interconnections  will  take 
up  much  of  the  chip  area,  the  feature  of  local  communications  of  our  design  can  greatly  reduce  the 
power  dissipation  in  the  routing  area.  Prom  the  discussions  in  Section  2.3,  it  can  be  shown  that  the 
total  power  consumption  for  the  multirate  16-point  DCT  can  be  reduced  to  0.29Po  for  M  =  2,  and 
0.1  lPo  for  M  =  4,  respectively.  The  significant  power  savings  for  the  design  with  M  =  4  is  achieved 
only  at  the  cost  of  3 N  —  3  extra  multipliers  and  3 N  +  3  extra  adders. 


3.3  Low-Power  Design  of  the  HR  IDCT 

The  HR  transfer  function  for  the  block  IDCT  is  given  by  [3] 


HiDCT,n{z ) 


(-irC(l)sinH,  +  _  C(1))Z-(W-D 

1  —  2  cos  unz~l  -I-  z~l 


(41) 


where  un  =  ^f^vr.  As  with  the  derivations  of  the  low-power  IIR  DCT,  the  multirate  transfer  function 
for  the  IDCT  with  M  =  2  can  be  derived  as 


XlDCT,n{z) 


+ 


- — — — 5 - _1  ,  _2  (Xe(z)  sin 2un  +  (1  +  z  1)Jf0(a()sinw„) 

1  Z  COS  t  z 

(C(o)-C(i))z-tN-Vx{z). 


(42) 


Similarly,  the  transfer  function  for  M  =  4  is 


XiDCT,n(z)  = 

+ 


- -  - _2  (Gq(z)  sin 4u%  +  [G\{z)  +  G3(z)z  x]  sin3wn 

X  Z  COS  Q(jJyiZ  i*  Z  ' 

(1  +  z~1)G2{z)  sin2u>n  +  \G3{z)  +  Gi(z)z~1}  sinwn) 

(C(0)  -  X(z). 


+ 


(43) 
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The  corresponding  low-power  HR  IDCT  structures  based  on  (42)  and  (43)  are  shown  in  Fig.  13  and 
Fig.14,  respectively,  where  the  multiplier  coefficient  is  defined  as 

Ts(m)  =  (— l)nC(l)  sinmo;n.  (44) 

As  we  can  see,  the  low-power  IDCT  design  has  similar  structures  as  the  low-power  DCT  except  little 
difference  in  the  common  circuit.  Therefore,  it  is  possible  to  integrate  both  the  forward  and  backward 
transforms  into  one  architecture  by  suitably  multiplexing  the  data  path  in  the  common  circuit  and 
the  coefficients  inside  the  modules. 

3.4  Polyphase  Representation 

In  the  preceding  discussions,  we  have  shown  how  to  perform  the  multirate  DCT/IDCT  by  rearranging 
the  ^-transforms  of  the  decimated  sequences.  Here  we  will  show  a  systematic  way  to  derive  the  results 
by  applying  the  polyphase  decomposition  to  the  original  HR  transfer  function. 

Substitute  the  identity  that 

1  _  1  +  2  cos  2a +  z~2 

1  —  2  cos  2wkZ~l  +  z~ 2  1  —  2  cos  4u>kZ~2  +  z~A 

into  the  HR  DCT  transfer  function  in  (33).  After  rearranging,  Hdct,Ic{z )  under  block  processing 
can  be  written  as 

HdctM  =  [■ H o(*2)  +  z~lHi{z2)\  (46) 

where 

D(z2)  =  1  —  2  cos  4a >kZ~2  +  z~*, 

Ho(z2)  =  (cosa;fc  -  cos3a;jt^_2), 

H\(z2)  =  (cos3a;fc  —  cosa;^^-2).  (47) 

(46)  is  the  polyphase  representation  of  Hr)CT,k(z )  with  M  =  2,  and  its  corresponding  polyphase 
implementation  is  shown  in  Fig.  15 (a).  The  downsampling  operation  l  N  at  the  right  end  denotes 

that  we  pick  up  the  DCT  coefficients  at  the  Nth  clock  cycle  and  ignore  all  the  previous  intermediate 

results.  Given  this  polyphase  implementation,  we  can  use  the  noble  identites  [5]  to  distribute  the 
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downsampling  operation  towards  the  left  and  obtain  Fig.  15(b),  which  will  lead  to  the  multirate  DCT 
architecture  in  Fig.  11.  Thus,  we  can  process  the  input  data  at  two  times  slower  clock  rate.  After 
N/2  iterations,  the  DCT  coefficients  are  available  at  the  output  ends.  Similarly,  M  =  4  can  be 
achieved  by  performing  another  polyphase  decomposition  on  D(zi^  in  (46).  After  some  algebraic 
simplifications,  we  can  obtain  (40)  and  its  corresponding  implementation  allows  us  to  operate  at  four 
times  slower  clock  rate.  The  polyphase  decomposition  can  also  be  used  to  derive  the  results  for  the 
multirate  IDCT.  In  the  companion  paper  [8],  we  will  apply  this  methodology  to  obtain  the  low-power 
architecture  of  logarithmic  complexity  as  well  as  the  unified  transformation  module  design. 

4  Comparisons  of  Architectures 

In  this  section,  we  would  like  to  discuss  the  hardware  complexity  of  the  two  algorithm-based  low- 
power  approaches  (the  Chebyshev  polynomial  approach  and  the  polyphase  decomposition  approach) 
proposed  in  this  paper.  Also,  we  will  compare  the  proposed  multirate  SIPO  architectures  with  the 
existing  SIPO  and  PIPO  architectures  [3]  [14].  Table  1  summarizes  the  hardware  cost  for  all  the 
proposed  architectures  under  normal  operation  and  under  multirate  operation  (M  =  2,4).  As  we 
can  see,  the  hardware  overhead  of  the  low-power  design  is  linear  complexity  increase  for  the  speed 
compensation.  As  to  the  two  approaches  (Chebyshev  and  polyphase),  the  Chebyshev  IDCT  requires 
(AT  —  1)  less  multipliers  than  the  HR  IDCT  in  both  normal  and  multirate  operations.  This  saving  is  in 
particular  preferable  for  the  applications  which  require  cost-effective  IDCT  such  as  HDTV  receivers. 
On  the  contrary,  the  Chebyshev  DCT  has  almost  the  same  complexity  as  the  HR  DCT.  Since  the 
Chebyshev  DCT  needs  one  more  iteration  to  finish  the  transform,  the  polyphase  HR  DCT  is  a  better 
choice  for  the  implementations. 

Next,  we  would  like  to  compare  our  low-power  DCT  architecture  with  those  proposed  in  [14] 
and  [3].  The  architecture  in  [14],  which  utilizes  factorization  method  to  perform  fast  DCT,  is  a 
typical  representative  of  the  PIPO  fast  algorithms.  The  HR  structure  proposed  in  [3],  on  the  other 
hand,  is  a  good  example  of  the  SIPO  algorithms.  A  comparison  regarding  their  inherent  properties 
is  listed  in  Table  2.  The  advantages  of  the  SIPO  approach  over  the  PIPO  approach  in  their  VLSI 
implementation,  such  as  local  communication  and  linear  hardware  complexity  increase,  have  been 
discussed  thoroughly  in  [2]  and  [3].  Nevertheless,  when  the  speed  compensation  capability  is  of 
concern,  the  PIPO  is  also  a  good  choice  since  the  block  PIPO  processing  with  block  size  N  is 
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equivalent  to  decimating  the  input  data  by  a  factor  of  N.  However,  this  advantage  is  obtained  at  the 
price  of  globally  increased  hardware  and  routing  paths.  Besides,  the  block  size  is  usually  restricted 
to  be  power  of  two  due  to  the  “divide-and-conquer”  nature  of  those  PIPO  fast  algorithms.  Prom 
Table  2,  we  can  see  that  our  multirate  SIPO  approach  is  a  good  compromise  between  the  other 
two  approaches.  Basically,  the  multirate  approach  inherits  all  the  advantages  of  the  existing  SIPO 
approach;  meanwhile,  it  can  compensate  the  speed  penalty  at  the  expense  of  “locally”  increased 
hardware  and  routing,  which  is  not  the  case  in  the  PIPO  approach.  Although  some  restriction  is 
imposed  on  the  data  size  N  due  to  the  downsampling  operation,  i.e., 

N  =  Mk,  keZ+  (48) 

(M  is  the  decimation  factor  and  Z+  denotes  any  positive  integer),  the  choice  of  N  is  much  more 
flexible  compared  with  the  PIPO  algorithms. 

The  other  advantage  of  the  SIPO  approaches  is  in  the  computation  of  the  pruning  DOT  [15]. 
In  the  DCT-based  signal  compression  algorithms,  the  most  useful  information  of  the  signal  is  kept 
in  the  low  frequency  DOT  components.  Therefore,  retaining  only  No  <  N  coefficients  is  sufficient 
for  the  lossy  data  compression.  Although  the  pruning  DOT  can  be  computed  from  the  PIPO  DOT 
architecture  by  removing  the  unnecessary  data  paths  and  computational  operators  [15],  the  global 
communication  is  still  the  major  drawback  for  its  implementation  as  N  increases.  On  the  contrary, 
the  SIPO  architecture  in  [3]  and  our  low-power  design  can  be  readily  applied  to  the  pruning  DCT  by 
simply  implementing  the  first  Nq  DCT  modules  for  the  computation  of  the  first  No  DCT  coefficients. 

5  Conclusions 

In  this  paper,  we  presented  the  algorithm-based  low-power  design  of  the  transform  coding  kernels 
based  on  the  multirate  approach.  We  have  shown  that  by  either  exploiting  the  properties  of  the 
Chebyshev  polynomial  or  reformulating  the  HR  DCT/IDCT  algorithms,  we  can  reduce  the  operating 
frequency  of  the  transforms  at  the  architectural  level  without  degrading  the  system  throughput  rate. 
Such  compensation  capability  will  lead  to  drastic  savings  in  the  total  power  consumption.  Therefore, 
the  proposed  low-power  transform  coding  kernels  will  be  effective  for  the  low-power/high-performance 
signal  processing  systems. 
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It  is  worth  mentioning  that  since  the  multirate  approach  is  derived  at  the  word  level,  other 
arithmetical-level  techniques,  such  as  bit-level  downsampling  [16]  and  distributed  arithmetic  [17] [18], 
can  be  employed  in  the  VLSI  implementation  to  further  reduce  the  power  consumption.  In  general, 
they  do  not  explicitly  exploit  the  inherent  properties  of  the  orthogonal  transforms.  As  a  result,  they 
achieve  the  speed  compensation  at  the  arithmetical  level  rather  than  at  the  algorithmic  level. 

The  other  attractive  application  of  our  design  is  in  the  very  high-speed  signal  processing.  Presently, 
in  most  VLSI  implementation  of  the  orthogonal  transforms,  the  input  data  rate  is  limited  by  the 
speed  of  the  adders  and  multipliers  in  the  circuit.  In  the  video-rate  applications  such  as  HDTV, 
the  speed  constraint  will  result  in  the  use  of  expensive  hig-speed  multiplier/adder  circuits  or  full- 
custom  design.  Thus,  the  manufacturing  cost  as  well  as  the  design  cycle  will  increase  drastically.  By 
employing  the  multirate  parallel  architectures  discussed  in  this  paper,  the  speed  constraint  can  be 
resolved  at  the  architectural  level  with  the  same  design  environment  and  fabrication  technology.  For 
example,  if  we  want  to  perform  DCT  for  serial  data  at  200  MHz,  we  may  use  the  parallel  architecture 
in  Fig.  12,  in  which  only  50MHz  adders  and  multipliers  are  required.  Therefore,  we  can  perform  very 
high-speed  DCT  by  using  only  low-cost  and  low-speed  processing  elements. 
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Normal  Operation 

Downsampling  by  2 
(AT  =  2) 

Downsampling  by  4 
(M  =  4) 

Extra 

iteration 

Multiplier 

Adder 

Multiplier 

Adder 

Multiplier 

Adder 

Chebyshev  DCT 

21V  —  2 

31V-  1 

31V -3 

AN 

51V -5 

61V +  3 

Yes 

Chebyshev  IDCT 

JV  +  2 

3  N 

21V +  2 

41V  +  1 

41V  +  2 

61V  +  4 

No 

HR  DCT 

21V -2 

2  N 

31V -3 

31V  +  1 

51V  —  5 

51V  +  3 

No 

HR  IDCT 

21V +  1 

3  N 

31V +  1 

41V  +  1 

51V  +  1 

61V  +  2 

No 

Table  1:  Comparison  of  hardware  cost  for  the  DCT  and  IDCT  architectures  with  their  low-power 
designs  in  terms  of  2-input  multipliers  and  2-input  adders. 


Liu  et.  al.  [3] 

Proposed  multirate  HR 
DCT  with  M  =  4 

Lee  [14] 

Data  processing  rate 

Is 

fs/M 

fs/N 

No.  of  Multipliers 

2N-2 

( M  +  1  )N  (in  order) 

)  log2  N  (in  order) 

No.  of  Adders 

2N 

(M  +  1  )N  (in  order) 

Latency 

N 

N 

[l°g2  Al  (log2  IV  —  l)]/2 

Restriction  on  transform  size  N 

No 

Mk,  k  G  Z+ 

2fc,  k  €  Z+ 

Requirement  for  input  buffer 

No 

No 

Yes 

Index  mapping 

No 

No 

Yes 

Communication 

Local 

Local 

Global 

I/O  operation 

SIPO 

SIPO 

PIPO 

Speed  compensation  capability 

N/A 

Good 

(  at  the  expense  of 
locally  increased 
hardware  overhead 
and  local  routing  ) 

Good 

(  at  the  expense  of 
globally  increased 
hardware  overhead 
and  global  routing  ) 

Power  consumption 
in  routing 

Negligible 

Negligible 

Noticeable 
as  N  increases 

Application  to  pruning  DCT 

Direct 

Direct 

Needs  many  modifications 
and  global  interconnections 

Table  2:  Comparisons  of  different  DCT  architectures,  where  fs  denotes  the  data  sample  rate,  M 
denotes  the  programmable  downsampling  factor,  and  N  is  the  block  size. 
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Figure  1:  (a)  Original  SIPO  DCT  circuit,  (b)  Low-power  DCT  circuit  using  the  multirate  approach. 


Figure  2:  Recursive  architecture  to  evaluate  Yj.{rj). 
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Reverse  Array  (RA)  I DCT  Module  Array 

Figure  3:  Parallel  Chebyshev  IDCT  architecture. 


DCT  Module  Array 


Figure  4:  Parallel  Chebyshev  DCT  architecture. 
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Operating  Frequency 


IDCT  Module  Array 


Figure  5:  Low-power  parallel  Chebyshev  IDCT  architecture  with  decimation  factor  of  two. 


Operating  Frequency 


DCT  Module  Array 


Figure  6:  Low-power  parallel  Chebyshev  DCT  architecture  with  decimation  factor  of  two. 
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Figure  7:  Evaluation  of  e;  using  the  downsampling  circuit. 


Figure  8:  Low-power  parallel  Chebyshev  IDCT  architecture  with  decimation  factor  of  four,  where 
Vn  =  2 (v'n)2  -  1  and  K'n  =  1/(2 Vn)- 
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HR  DCT  Module  Array 


Figure  11:  Low-power  polyphase  HR  DCT  architecture  with  M  =  2. 
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HDCT,k(t) 


HdctM t) 


Figure  15:  (a)  Polyphase  representation  of  HdctM z)-  (b)  Polyphase  representation  of  HucT,k(z) 
after  applying  the  noble  identity. 


